-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask in pyodide #9053
Dask in pyodide #9053
Conversation
cc @rth (who may find this of interest 🙂) |
Thanks for taking a look @jakirkham ! |
Yes, pragmatically this seems to be a reasonable first step to get Dask running in Pyodide.
Note that currently importing threading and multiprocessing would still work in Pyodide. Even part of the API works (e.g using locks). What would fail is actually starting a thread or process (or importing private modules). Or at least, this has been the idea in Pyodide so far, to maximize the number of packages that could be imported without error if they use e.g. multiprocessing somewhere. The reason you are able to catch import errors currently is because some of the deeper APIs would indeed raise an ImportError (and maybe we are not fully consistent about what raises an exception and what doesn't). in Pyodide repl
So there might be more reliable/standard ways of detecting that threading/multiprocessing is not available, and if you have feedback on how to improve the current situation, we would be interested in hearing it. |
Thanks for weighing in @rth!
This is exactly what I was seeing ( |
I suppose what I'm after (per the FAQs )is to check |
Would that platform tag change if/when the ecosystem moves to WASM? Seems like we would want this check in as few places as possible to keep updating/fixing when needed simple |
I opened https://discuss.python.org/t/expected-behavior-for-unsupported-stdlib-modules-in-the-browser/ to discuss it. It should be OK. As Christian mentioned, the best way to detect a platform is to use,
This works in Pyodide now and will work in the future. But then some Emscripten builds for Node.js do have threading enabled https://pythondev.readthedocs.io/wasm.html#targets (and Pyodide might also enable it in the future) so that's also not very reliable. So to detect,
Christian also proposed adding more specific feature-detection flags in the above-linked thread, but it would take a while to propagate until Pyodide, so if you want something working now, the current approach is probably the best choice. |
Thanks for the detailed information @rth, and thanks for opening the upstream discussion.
I pushed a change to check
Yeah, I think I was imprecise before, the problem isn't with the stdlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @ian-r-rose. I still need to take a more detailed look at the changes here, but from a cursory look I don't see any tests. I'm not familiar with pyodide
-- is it straightforward to add tests that make sure things are working as expected?
It's not really straightforward as it's a completely different execution environment (and the defensive imports are around aspects of the standard library not being available). That said, I think I could monkeypatch Edit: on reflection, since this logic only happens on startup, it's tough to mock. I'm not sure I see a good way to test this behavior, would love to hear ideas. |
ipycytoscape later.
with regards to argument name, this just enforces it and allows for type checking of getters.
Okay, I've figured out a CI test for this behavior. It ain't pretty, but it works. Would love to get this in for 2022.6.0 @jrbourbeau |
dask/array/core.py
Outdated
if _EMSCRIPTEN: | ||
from dask import local | ||
|
||
DEFAULT_GET = local.get_sync | ||
else: | ||
from dask import threaded | ||
|
||
DEFAULT_GET = threaded.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to have this here. Should we move it to dask.local
instead?
Edit: This pattern repeats below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean. Are you objecting to repeating this clause for all the different collections? In principle, it might make sense to consolidate, though I've kept them collection-specific since Bag
has a different default scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a bit worrying that Emscripten logic is leaking into other parts of the codebase like this.
Think we could do a bit better by containing with some changes in those schedulers and then logic here and elsewhere like
if _EMSCRIPTEN: | |
from dask import local | |
DEFAULT_GET = local.get_sync | |
else: | |
from dask import threaded | |
DEFAULT_GET = threaded.get | |
try: | |
from dask.threaded import get as DEFAULT_GET | |
except ImportError: | |
from dask.local import get_sync as DEFAULT_GET |
Maybe this could be handled other ways like having a centralized registry (dict
?) of different get
s and we could look for the appropriate one there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some suggestions below on handling the get
logic
dask/array/core.py
Outdated
if _EMSCRIPTEN: | ||
from dask import local | ||
|
||
DEFAULT_GET = local.get_sync | ||
else: | ||
from dask import threaded | ||
|
||
DEFAULT_GET = threaded.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a bit worrying that Emscripten logic is leaking into other parts of the codebase like this.
Think we could do a bit better by containing with some changes in those schedulers and then logic here and elsewhere like
if _EMSCRIPTEN: | |
from dask import local | |
DEFAULT_GET = local.get_sync | |
else: | |
from dask import threaded | |
DEFAULT_GET = threaded.get | |
try: | |
from dask.threaded import get as DEFAULT_GET | |
except ImportError: | |
from dask.local import get_sync as DEFAULT_GET |
Maybe this could be handled other ways like having a centralized registry (dict
?) of different get
s and we could look for the appropriate one there
Thanks for the suggestion @jakirkham, this seems sensible to me. I had moved away from an |
Hmm, @jakirkham your suggestion is actually significantly more difficult to test in CI, since it appears that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ian-r-rose -- apologies for the delayed response
Another option that came up in this comment ( #9053 (comment) ) was using a |
I tried that as well, and wasn't really happy with it either, in that the mock just felt like a tautology. I'll push a commit for comparison. |
cf. 78766ad It's not a great test, but I suppose it could do |
Gotcha. This does feel cleaner. Think the testing point merely highlights it is hard to test this well outside of just running with Pyodide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a couple small follow up suggestions based on the last change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We should keep an eye on CI for a few days to seem if any of the keyword renames in functions cause any issues (though probably not)
Thanks for the review @jakirkham! |
Merging this ahead of the release on Friday. Thanks for working on this @ian-r-rose it seems like an exciting first step! |
This is a small step towards #7764 and dask/distributed#6257. It's basically just defensively importing
threading
andmultiprocessing
and defaulting to the synchronous scheduler if those fail. So I think this is currently mostly be useful for demos and training around the dask collections API. But it does work.This is distinct from actually getting a
distributed.Client
working and talking to a remote cluster, which will require some actual networking work.pre-commit run --all-files